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Abstract 

Web  pages  often  contain  clutter  (such  as 
pop-up  ads,  unnecessary  images  and  extraneous 
links)  around  the  body  of  an  article  that  distract 
a  user  from  actual  content.  Extraction  of  “useful 
and  relevant”  content  from  web  pages  has  many 
applications,  including  cell  phone  and  PDA 
browsing,  speech  rendering  for  the  visually 
impaired,  and  text  summarization.  Most 
approaches  to  removing  clutter  or  making 
content  more  readable  involve  changing  font 
size  or  removing  HTML  and  data  components 
such  as  images,  which  takes  away  from  a 
webpage’s  inherent  look  and  feel.  Unlike 
“Content  Reformatting”,  which  aims  to 
reproduce  the  entire  webpage  in  a  more 
convenient  form,  our  solution  directly  addresses 
“Content  Extraction”.  We  have  developed  a 
framework  that  employs  an  easily  extensible  set 
of  techniques  that  incorporate  advantages  of 
previous  work  on  content  extraction.  Our  key 
insight  is  to  work  with  the  Document  Object 
Model  tree,  rather  than  with  raw  HTML 
markup.  We  have  implemented  our  approach  in 
a  publicly  available  Web  proxy  to  extract 
content  from  HTML  web  pages. 

1.  Introduction 

Web  pages  are  often  cluttered  with 
distracting  features  around  the  body  of  an  article 
that  distract  the  user  from  the  actual  content 
they’re  interested  in.  These  “features”  may 
include  pop-up  ads,  flashy  banner 
advertisements,  unnecessary  images,  or  links 
scattered  around  the  screen.  Automatic 


extraction  of  useful  and  relevant  content  from 
web  pages  has  many  applications,  ranging  from 
enabling  end  users  to  accessing  the  web  more 
easily  over  constrained  devices  like  PDAs  and 
cellular  phones  to  providing  better  access  to  the 
web  for  the  visually  impaired. 

Most  traditional  approaches  to  removing 
clutter  or  making  content  more  readable  involve 
increasing  font  size,  removing  images,  disabling 
JavaScript,  etc.,  all  of  which  eliminate  the 
webpage’s  inherent  look-and-feel.  Examples 
include  WPAR  [18L  Webwiper  [19]  and 
JunkBusters  [20],  All  of  these  products  involve 
hardcoded  techniques  for  certain  common  web 
page  designs  as  well  as  “blacklists”  of 
advertisers.  This  can  produce  inaccurate  results 
if  the  software  encounters  a  layout  that  it  hasn’t 
been  programmed  to  handle.  Another  approach 
has  been  content  reformatting  which  reorganizes 
the  data  so  that  it  fits  on  a  PDA;  however,  this 
does  not  eliminate  clutter  but  merely 
reorganizes  it.  Opera  [21],  for  example,  utilizes 
their  proprietary  Small  Screen  Rendering 
technology  that  reformats  web  pages  to  fit 
inside  the  screen  width.  We  propose  a  “Content 
Extraction”  technique  that  can  remove  clutter 
without  destroying  webpage  layout,  making 
more  of  a  page’s  content  viewable  at  once. 

Content  extraction  is  particularly  useful 
for  the  visually  impaired  and  blind.  A  common 
practice  for  improving  web  page  accessibility 
for  the  visually  impaired  is  to  increase  font  size 
and  decrease  screen  resolution;  however,  this 
also  increases  the  size  of  the  clutter,  reducing 
effectiveness.  Screen  readers  for  the  blind,  like 


Report  Documentation  Page 


Form  Approved 
OMB  No.  0704-0188 


Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 
VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 


1.  REPORT  DATE 

2005 


2.  REPORT  TYPE 


3.  DATES  COVERED 


5a.  CONTRACT  NUMBER 


5b.  GRANT  NUMBER 


5c.  PROGRAM  ELEMENT  NUMBER 


5d.  PROJECT  NUMBER 


5e.  TASK  NUMBER 


5f.  WORK  UNIT  NUMBER 


4.  TITLE  AND  SUBTITLE 

DOM-based  Content  Extraction  of  HTML  Documents 


6.  AUTHOR(S) 


7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS (ES)  8.  PERFORMING  ORGANIZATION 

Defense  Advanced  Research  Projects  Agency ,3701  North  Fairfax  report  number 

Drive,  Arlington, V  A, 22203- 1714 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS  (ES)  10.  SPONSOR/MONITOR’S  ACRONYM(S) 

11.  SPONSOR/MONITOR’S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

The  original  document  contains  color  images. 

14.  ABSTRACT 

Web  pages  often  contain  clutter  (such  as  pop-up  ads,  unnecessary  images  and  extraneous  links)  around  the 
body  of  an  article  that  distract  a  user  from  actual  content.  Extraction  of  "useful  and  relevant"  content 
from  web  pages  has  many  applications,  including  cell  phone  and  PDA  browsing,  speech  rendering  for  the 
visually  impaired,  and  text  summarization.  Most  approaches  to  removing  clutter  or  making  content  more 
readable  involve  changing  font  size  or  removing  HTML  and  data  components  such  as  images,  which  takes 
away  from  a  webpage’s  inherent  look  and  feel.  Unlike  "Content  Reformatting",  which  aims  to  reproduce 
the  entire  webpage  in  a  more  convenient  form,  our  solution  directly  addresses  "Content  Extraction".  We 
have  developed  a  framework  that  employs  an  easily  extensible  set  of  techniques  that  incorporate 
advantages  of  previous  work  on  content  extraction.  Our  key  insight  is  to  work  with  the  Document  Object 
Model  tree,  rather  than  with  raw  HTML  markup.  We  have  implemented  our  approach  in  a  publicly 
available  Web  proxy  to  extract  content  from  HTML  web  pages. 


15.  SUBJECT  TERMS 


16.  SECURITY  CLASSIFICATION  OF: 


a.  REPORT 

unclassified 


b.  ABSTRACT 

unclassified 


c.  THIS  PAGE 

unclassified 


17.  LIMITATION  OF 

18.  NUMBER 

ABSTRACT 

OF  PAGES 

10 

19a.  NAME  OF 
RESPONSIBLE  PERSON 


Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


Hal  Screen  Reader  by  Dolphin  Computer 
Access  or  Microsoft’s  Narrator,  don’t  usually 
automatically  remove  such  clutter  either  and 
often  read  out  full  raw  HTML.  Therefore,  both 
groups  benefit  from  extraction,  as  less  material 
must  be  read  to  obtain  the  desired  results. 

Natural  Language  Processing  (NLP)  and 
information  retrieval  (IR)  algorithms  can  also 
benefit  from  content  extraction,  as  they  rely  on 
the  relevance  of  content  and  the  reduction  of 
“standard  word  error  rate”  to  produce  accurate 
results  [1_3].  Content  extraction  allows  the 
algorithms  to  process  only  the  extracted  content 
as  input  as  opposed  to  cluttered  data  coming 
directly  from  the  web  [14],  Currently,  most 
NLP-based  information  retrieval  applications 
require  writing  specialized  extractors  for  each 
web  domain  [14] [15].  While  generalized  content 
extraction  is  less  accurate  than  hand-tailored 
extractors,  they  are  often  sufficient  [221  and 
reduce  labor  involved  in  adopting  information 
retrieval  systems. 

While  many  algorithms  for  content 
extraction  already  exist,  few  working 
implementations  can  be  applied  in  a  general 
manner.  Our  solution  employs  a  series  of 
techniques  that  address  the  aforementioned 
problems.  In  order  to  analyze  a  web  page  for 
content  extraction,  we  pass  web  pages  through 
an  HTML  parser  that  corrects  the  markup  and 
creates  a  Document  Object  Model  tree.  The 
Document  Object  Model  (www.w3.org/DOM) 
is  a  standard  for  creating  and  manipulating  in¬ 
memory  representations  of  HTML  (and  XML) 
content.  By  parsing  a  web  site's  HTML  into  a 
DOM  tree,  we  can  not  only  extract  information 
from  large  logical  units  similar  to 
Buyukkokten’s  “Semantic  Textual  Units” 
(STUs,  see  [3]  [4]),  but  can  also  manipulate 
smaller  units  such  as  specific  links  within  the 
structure  of  the  DOM  tree.  In  addition,  DOM 
trees  are  highly  editable  and  can  be  easily  used 
to  reconstruct  a  complete  web  site.  Finally, 
increasing  support  for  the  Document  Object 
Model  makes  our  solution  widely  portable. 


2.  Related  Work 

There  is  a  large  body  of  related  work  in 
content  identification  and  information  retrieval 
that  attempts  to  solve  similar  problems  using 
various  other  techniques.  Finn  et  al.  [!]  discuss 
methods  for  content  extraction  from  “single¬ 
article”  sources,  where  content  is  presumed  to 
be  in  a  single  body.  The  algorithm  tokenizes  a 
page  into  either  words  or  tags;  the  page  is  then 
sectioned  into  3  contiguous  regions,  placing 
boundaries  to  partition  the  document  such  that 
most  tags  are  placed  into  outside  regions  and 
word  tokens  into  the  center  region.  This 
approach  works  well  for  single-body 
documents,  but  destroys  the  structure  of  the 
HTML  and  doesn’t  produce  good  results  for 
multi-body  documents,  i.e.,  where  content  is 
segmented  into  multiple  smaller  pieces, 
common  on  Web  logs  (“blogs”)  like  Slashdot 
(http://slashdot.org).  In  order  for  content  of 
multi-body  documents  to  be  successfully 
extracted,  the  running  time  of  the  algorithm 
would  become  polynomial  time  with  a  degree 
equal  to  the  number  of  separate  bodies,  i.e., 
extraction  of  a  document  containing  8  different 
bodies  would  run  in  0(N8),  N  being  the  number 
of  tokens  in  the  document. 

McKeown  et  al.  [8]  [9],  in  the  NLP  group 
at  Columbia  University,  detects  the  largest  body 
of  text  on  a  webpage  (by  counting  the  number 
of  words)  and  classifies  that  as  content.  This 
method  works  well  with  simple  pages. 
However,  this  algorithm  produces  noisy  or 
inaccurate  results  handling  multi-body 
documents,  especially  with  random 
advertisement  and  image  placement. 

Rahman  et  al.  [2]  propose  another 
technique  that  uses  structural  analysis, 
contextual  analysis,  and  summarization.  The 
structure  of  an  HTML  document  is  first 
analyzed  and  then  properly  decomposed  into 
smaller  subsections.  The  content  of  the 
individual  sections  is  then  extracted  and 
summarized.  However,  this  proposal  has  yet  to 


be  implemented.  Furthermore,  while  the  paper 
lays  out  prerequisites  for  content  extraction,  it 
doesn’t  actually  propose  methods  to  do  so. 

A  variety  of  approaches  have  been 
suggested  for  formatting  web  pages  to  fit  on  the 
small  screens  of  cellular  phones  and  PDAs 
(including  the  Opera  browser  [16],  and 
Bitstream  ThunderHawk  [12]);  they  only 
reorganize  the  content  of  the  webpage  to  fit  on  a 
constrained  device  and  require  a  user  to  scroll 
and  hunt  for  content. 

Buyukkokten  et  al.  [311101  define 
“accordion  summarization”  as  a  strategy  where 
a  page  can  be  shrunk  or  expanded  much  like  the 
instrument.  They  also  discuss  a  method  to 
transform  a  web  page  into  a  hierarchy  of 
individual  content  units  called  Semantic  Textual 
Units,  or  STUs.  First,  STUs  are  built  by 
analyzing  syntactic  features  of  an  HTML 
document,  such  as  text  contained  within 
paragraph  (<P>),  table  cell  (<TD>),  and  frame 
component  (<FRAME>)  tags.  These  features 
are  then  arranged  into  a  hierarchy  based  on  the 
HTML  formatting  of  each  STU.  STUs  that 
contain  HTML  header  tags  (<H1>,  <H2>,  and 
<H3>)  or  bold  text  (<B>)  are  given  a  higher 
level  in  the  hierarchy  than  plain  text.  This 
hierarchical  structure  is  finally  displayed  on 
PDAs  and  cellular  phones.  While 
Buyukkokten’ s  hierarchy  is  similar  to  our  DOM 
tree-based  model,  DOM  trees  remain  highly 
editable  and  can  easily  be  reconstructed  back 
into  a  complete  web  site.  DOM  trees  are  also  a 
widely-adopted  W3C  standard,  easing  support 
and  integration  of  our  technology.  The  main 
problem  with  the  STU  approach  is  that  once  the 
STU  has  been  identified,  Buyukkokten,  et  al. 
[3]  [4]  perform  summarization  on  the  STUs  to 
produce  the  content  that  is  then  displayed  on 
PDAs  and  cell  phones.  However,  this  requires 
editing  the  original  content  and  displaying 
information  that  is  different  from  the  original 
work.  Our  approach  retains  all  original  work. 


Kaasinen  et  al.  [5],  discusses  methods  to 
divide  a  web  page  into  individual  units  likened 
to  cards  in  a  deck.  Like  STUs,  a  web  page  is 
divided  into  a  series  of  hierarchical  “cards”  that 
are  placed  into  a  “deck”.  This  deck  of  cards  is 
presented  to  the  user  one  card  at  a  time  for  easy 
browsing.  The  paper  also  suggests  a  simple 
conversion  of  HTML  content  to  WML 
(Wireless  Markup  Language),  resulting  in  the 
removal  of  simple  information  such  as  images 
and  bitmaps  from  the  web  page  so  that  scrolling 
is  minimized  for  small  displays.  While  this 
reduction  has  advantages,  the  method  proposed 
in  that  paper  shares  problems  with  STUs.  The 
problem  with  the  deck-of-cards  model  is  that  it 
relies  on  splitting  a  site  into  tiny  sections  that 
can  then  be  browsed  as  windows.  But  this 
means  that  it  is  up  to  the  user  to  determine  on 
which  cards  the  actual  contents  are  located. 

None  of  the  concepts  solve  the  problem 
of  automatically  extracting  just  the  content, 
although  they  do  provide  simpler  means  in 
which  the  content  can  be  found.  Thus,  these 
concepts  limit  analysis  of  web  sites.  By  parsing 
a  web  site  into  a  DOM  tree,  more  control  can  be 
achieved  while  extracting  content. 

3.  Our  Approach 

Our  solution  employs  multiple 
extensible  techniques  that  incorporate  the 
advantages  of  the  previous  work  on  content 
extraction.  In  order  to  analyze  a  web  page  for 
content  extraction,  the  page  is  first  passed 
through  an  HTML  parser  that  corrects  the 
HTML  and  creates  a  Document  Object  Model 
tree  representation  of  the  web  page.  Once 
processed,  the  resulting  DOM  document  can  be 
seamlessly  shown  as  a  webpage  to  the  end-user 
as  if  it  were  HTML.  This  process  accomplishes 
the  steps  of  structural  analysis  and  structural 
decomposition  done  by  Rahman’s, 
Buyukkokten’ s  and  Kaasinen’ s  techniques  (see 
Section  2).  The  DOM  tree  is  hierarchically 
arranged  and  can  be  analyzed  in  sections  or  as  a 
whole,  providing  a  wide  range  of  flexibility  for 


our  extraction  algorithm.  Just  as  the  approach 
mentioned  by  Kaasinen  et  al.  modifies  the 
HTML  to  restructure  the  content  of  the  site,  our 
content  extractor  navigates  the  DOM  tree 
recursively,  using  a  series  of  different  filtering 
techniques  to  remove  and  modify  specific  nodes 
and  leave  only  the  content  behind.  Each  of  the 
filters  can  be  easily  turned  on  and  off  and 
customized  to  a  certain  degree. 

There  are  two  sets  of  filters,  with 
different  levels  of  granularity.  The  first  set  of 
filters  simply  ignores  tags  or  specific  attributes 
within  tags.  With  these  filters,  images,  links, 
scripts,  styles,  and  many  other  elements  can  be 
quickly  removed  from  the  web  page.  This 
process  of  filtering  is  similar  to  Kaasinen’ s 
conversion  of  HTML  to  WML.  However,  the 
second  set  of  filters  is  more  complex  and 
algorithmic,  providing  a  higher  level  of 
extraction  than  offered  by  the  conversion  of 
HTML  to  WML.  This  set,  which  can  be 
extended,  consists  of  the  advertisement 
remover,  the  link  list  remover,  the  empty  table 
remover,  and  the  removed  link  retainer. 

The  advertisement  remover  uses  an 
efficient  technique  to  remove  advertisements. 
As  the  DOM  tree  is  parsed,  the  values  of  the 
“src”  and  “href’  attributes  throughout  the  page 
are  surveyed  to  determine  the  servers  to  which 
the  links  refer.  If  an  address  matches  against  a 
list  of  common  advertisement  servers,  the  node 
of  the  DOM  tree  that  contained  the  link  is 
removed.  This  process  is  similar  to  the  use  of  an 
operating  systems-level  “hosts”  file  to  prevent  a 
computer  from  connecting  to  advertiser  hosts. 
Hanzlik  [6]  examines  this  technique  and  cites  a 
list  of  hosts,  which  we  use  for  our  advertisement 
remover. 

The  link  list  remover  employs  a  filtering 
technique  that  removes  all  “link  lists”,  which  are 
table  cells  for  which  the  ratio  of  the  number  of 
links  to  the  number  of  non-linked  words  is 
greater  than  a  specific  threshold  (known  as  the 
link/text  removal  ratio).  When  the  DOM  parser 


encounters  a  table  cell,  the  Link  List  Remover 
tallies  the  number  of  links  and  the  number  of 
non-linked  words.  The  number  of  non-linked 
words  is  determined  by  taking  the  number  of 
letters  not  contained  in  a  link  and  dividing  it  by 
the  average  number  of  characters  per  word, 
which  we  preset  as  5  (although  it  may  be 
overridden  by  the  user).  If  the  ratio  is  greater 
than  the  user-determined  link/text  removal  ratio, 
the  content  of  the  table  cell  (and,  optionally,  the 
cell  itself)  is  removed.  This  algorithm  succeeds 
in  removing  most  long  link  lists  that  tend  to 
reside  along  the  sides  of  web  pages  while 
leaving  the  text-intensive  portions  of  the  page 
intact. 

The  empty  table  remover  removes  tables 
that  are  empty  of  any  “substantive”  information. 
The  user  determines,  through  settings,  which 
HTML  tags  should  be  considered  to  be 
substance  and  how  many  characters  within  a 
table  are  needed  to  be  viewed  as  substantive. 
The  table  remover  checks  a  table  for  substance 
after  it  has  been  parsed  through  the  filter.  If  a 
table  has  no  substance,  it  is  removed  from  the 
tree.  This  algorithm  effectively  removes  any 
tables  leftover  from  previous  filters  that  contain 
small  amounts  of  unimportant  information. 

While  the  above  filters  remove  non- 
content  from  the  site,  the  removed  link  retainer 
adds  link  information  back  at  the  end  of  the 
document  to  keep  the  page  browseable.  The 
removed  link  retainer  keeps  track  of  all  the  text 
links  that  are  removed  throughout  the  filtering 
process.  After  the  DOM  tree  is  completely 
parsed,  the  list  of  removed  links  is  added  to  the 
bottom  of  the  page.  In  this  way,  any  important 
navigational  links  that  were  previously  removed 
remains  accessible. 

After  the  entire  DOM  tree  is  parsed  and 
modified  appropriately,  it  can  be  output  in  either 
HTML  or  as  plain  text.  The  plain  text  output 
removes  all  the  tags  and  retains  only  the  text  of 
the  site,  while  eliminating  most  white  space. 
The  result  is  a  text  document  that  contains  the 


main  content  of  the  site  in  a  format  suitable  for 
summarization,  speech  rendering  or  storage. 
This  technique  is  significantly  different  from 
Rahman  et  al.  [2],  which  states  that  a 
decomposed  web  site  should  be  analyzed  to  find 
the  content.  Our  algorithm  doesn’t  find  the 
content  but  eliminates  non-content.  In  this 
manner,  we  can  still  process  and  return  results 
for  sites  that  don’t  have  an  explicit  “main 
body”. 

4.  Implementation 

In  order  to  make  our  extractor  easy  to 
use,  we  implemented  it  as  a  web  proxy 
(program  and  instructions  are  accessible  at 
http://www.psl.cs.columbia.edu/proxvl.  This 
allows  an  administrator  to  set  up  the  extractor 
and  provide  content  extraction  services  for  a 
group.  The  proxy  is  coupled  with  a  graphical 
user  interface  (GUI)  to  customize  its  behavior. 
The  separate  screens  of  the  GUI  are  shown  in 
figures  1,  2,  and  3.  The  current  implementation 
of  the  proxy  is  in  Java  for  cross-platform 
support. 
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Figure  1 


Figure  2 


Figure  3 

While  the  Content  Extraction 
algorithm’s  worst  case  running  time  is  0(N  )  for 
complex  nested  tables;  without  such  nesting,  the 
typical  running  time  is  0(N),  where  N  is  the 
number  of  nodes  in  the  DOM  tree  after  the 
HTML  page  is  parsed.  During  tests,  the 
algorithm  performs  quickly  and  efficiently 
following  the  one-time  proxy  customization. 
The  proxy  can  handle  most  web  pages, 
including  those  with  badly  formatted  HTML, 


because  of  the  corrections  automatically  applied 
while  the  page  is  parsed  into  a  DOM  tree. 

Depending  on  the  type  and  complexity 
of  the  web  page,  the  content  extraction  suite  can 
produce  a  wide  variety  of  output.  The  algorithm 
performs  well  on  pages  with  large  blocks  of  text 
such  as  news  articles  and  mid-size  to  long 
informational  passages.  Most  navigational  bars 
and  extraneous  elements  of  web  pages  such  as 
advertisements  and  side  panels  are  removed  are 
reduced  in  size.  Figures  4  and  5  show  an 
example. 
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Figure  5  -  After 


When  printed  out  in  text  format,  most  of 
the  resulting  text  is  directly  related  to  the 
content  of  the  site,  making  it  possible  to  use 
summarization  and  keyword  extraction 
algorithms  efficiently  and  accurately.  Pages 
with  little  or  no  textual  content  are  extracted 
with  varying  results.  An  example  of  text  format 
extraction  performed  on  the  webpage  in  figure  5 
is  shown  in  figure  6. 


The  initial  implementation  of  the  proxy 
was  designed  for  simplicity  in  order  to  test  and 
design  content  extraction  algorithms.  It  spawns 
a  new  thread  to  handle  each  new  connection, 
limiting  its  scalability.  Most  of  the  performance 
drop  from  using  the  proxy  originates  from  the 
proxy’s  need  to  download  the  entire  page  before 
sending  it  to  the  client.  Future  versions  will  use 
a  staged  event  architecture  and  asynchronous 
callbacks  to  avoid  threading  scalability  issues. 

4.1.  Further  examples 

Figures  7  and  8  show  a  example  of  a 
typical  page  from  www. spacer.com  and  a 
filtered  version  of  that  site,  respectively.  This  is 
another  good  example  of  a  site  that  is  presented 
in  a  content-rich  format.  On  the  other  hand, 


Figures  9  and  10  show  the  front  page  of 
www.planetunreal.com,  a  site  dedicated  to  the 
Unreal  Tournament  2003  first-person  shooter 
game  (www.epicgames.com),  before  and  after 
content  extraction.  Despite  producing  results 
that  are  rich  in  text,  screenshots  of  the  game  are 
also  removed,  which  the  user  might  deem 
relevant  content. 
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released  and  is  available  for  Windows,  Mac,  and  Linux. 
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dedicated  servers,  to  support  the  expected  amount  of  players. 
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we’ll  do  that  right  away. 

For  those  of  you  who  are  new  to  the  game,  we  have  worked  out  a 
nice  features  list.  Those  who  are  already  aeqainted  with  TTR  and 
are  eagerly  awaiting  the  release,  will  find  the  extensive  manual  a 
very  interesting  and  usefull  read,  to  get  more  insight  in  The  Third 
Reich,  giving  you  an  edge  when  you’E  go  to  battle.in  general. 

One  more  thing  before  you  start  downloading  and  enjoying  our  hard 
work :)  We’d  really  like  to  hear  your  feedback  on  the  initial  release, 
since  we  will  evaluate  it.  Since  as  you  have  probably  already  seen 
on  the  ut2003  link  we’re  going  to  start  working  with  the  unreal 
tournament  2003  engine.  However  if  feedback  is  good  and  big 
enough  we  will  continue  to  support  the  UT  version  and  keep  it 
updated  with  new  maps,  models  and  weapons. 

Here’s  the  download  link.  Go  play,  and  run  servers  if  you’re  able!  And  remember, 
TTR  is  for  Unreal  Tournament ,  not  UT2003. 
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Figure  10  -  After 

Figures  11  and  12  show  www.msn.com 
in  its  pre-  and  post-filtered  state.  Since  the  site 
is  a  portal  which  contains  links  and  little  else, 
the  proxy  does  not  find  any  coherent  content  to 
keep.  We  are  investigating  heuristics  that  would 
leave  such  pages  untouched. 


Figure  8  -  After 
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Figure  11  -  Before 
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Figure  12  -  After 


4.2.  Implementation  details 


In  order  to  analyze  a  web  page  for 
content  extraction,  the  page  is  passed  through  an 
HTML  parser  that  creates  a  Document  Object 
Model  tree.  The  algorithm  begins  by  starting  at 
the  root  node  of  the  DOM  tree  (the  <HTML> 
tag),  and  proceeds  by  parsing  through  its 
children  using  a  recursive  depth  first  search 
function  called  filterNodeQ.  The  function 


initializes  a  Boolean  variable  ( mCheckChildren ) 
to  true  to  allow  fdterNode()  to  check  the 
children.  The  currently  selected  node  is  then 
passed  through  a  filter  method  called 
passThroughFilters()  that  analyzes  and  modifies 
the  node  based  on  a  series  of  user-selected 
preferences.  At  any  time  within 
passThroughFilters(),  the  mCheckChildren 
variable  can  be  set  to  false,  which  allows  the 
individual  filter  to  prevent  specific  subtrees 
from  being  filtered.  After  the  node  is  filtered 
accordingly,  fdterNode()  is  recursively  called 
using  the  children  if  the  mCheckChildren 
variable  is  still  true. 

The  filtering  method, 

passThroughFilters() ,  performs  the  majority  of 
the  content  extraction.  It  begins  by  examining 
the  node  it  is  passed  to  see  if  it  is  a  “text  node” 
(data)  or  an  “element  node”  (HTML  tag). 
Element  nodes  are  examined  and  modified  in  a 
series  of  passes.  First,  any  filters  that  edit  an 
element  node  but  do  not  delete  it  are  applied. 
For  example,  the  user  can  enable  a  preference 
that  will  remove  all  table  cell  widths,  and  it 
would  be  applied  in  the  first  phase  because  it 
modifies  the  attributes  of  table  cell  nodes 
without  deleting  them. 


Figure  13.  Architectural  diagram  of  the  system 


The  second  phase  in  examining  element 
nodes  is  to  apply  all  filters  that  delete  nodes 


from  the  DOM  tree.  Most  of  these  filters 
prevent  the  filterNode()  method  from 
recursively  checking  the  children  by  setting 
mCheckChildren  to  false.  A  few  of  the  filters  in 
this  subset  set  mCheckChildren  to  true  so  as  to 
continue  with  a  modified  version  of  the  original 
fdterNode()  method.  For  example,  the  empty 
table  remover  filter  sets  mCheckChildren  to 
false  so  that  it  can  itself  recursively  search 
through  the  <TABLE>  tag  using  a  bottom-up 
depth  first  search  while  fdterNode()  uses  a  top- 
down  depth  first  search.  Finally,  if  the  node  is  a 
text  node,  any  text  filters  are  applied  (there  are 
currently  none,  but  there  may  be  in  the  future). 

4.3.  Implementation  as  an  integrable  framework 

Since  a  content  extraction  algorithm  can 
be  applied  to  many  different  fields,  we 
implemented  it  so  that  it  can  be  easily  used  in  a 
variety  of  cases.  Through  an  extensive  set  of 
preferences,  the  extraction  algorithm  can  be 
highly  customized  for  different  uses.  These 
settings  are  easily  editable  through  the  GUI, 
method  calls,  or  direct  manipulation  of  the 
settings  file  on  disk.  The  GUI  itself  can  also 
easily  be  easily  integrated  (as  a  Swing  JPanel) 
into  any  project.  The  content  extraction 
algorithm  is  also  implemented  as  an  interface 
for  easy  incorporation  into  other  Java  programs. 
The  content  extractor’s  broad  set  of  features  and 
customizability  allow  others  to  easily  add  their 
own  version  of  the  algorithm  to  any  experiment 
or  product. 

5.  Future  Work 

The  current  implementation  in  Java  uses 
a  3rd-party  HTML  parser  to  create  DOM  trees 
from  web  pages.  Unfortunately,  most  of  the 
publicly-available  Java  HTML  parsers  are  either 
missing  support  for  important  features,  such  as 
XHTML  or  dynamically-generated  pages  (ASP, 
JSP).  To  resolve  this,  we  intend  to  support 
commercial  parsers,  such  as  Microsoft’s  HTML 
parser  (which  is  used  in  Internet  Explorer),  in 
the  next  revision.  These  are  much  more  robust 


and  support  a  wider  variety  of  content. 
Integration  will  be  accomplished  by  porting  the 
existing  proxy  to  C#/.NET,  which  will  allow  for 
easy  integration  with  COM  components  (of 
which  the  MS  HTML  parser  is  one). 

We  are  also  working  on  improving  the 
proxy’s  performance;  in  particular,  we  aim  to 
improve  both  latency  and  scalability  of  the 
current  version.  The  latency  of  the  system  will 
be  reduced  by  adopting  a  commercial  parser  and 
by  improving  our  algorithms  to  process  DOM 
incrementally  as  the  page  is  being  loaded. 
Scalability  will  be  addressed  by  re-architecting 
the  proxy’s  concurrency  model  to  avoid  the 
current  thread-per-client  model,  adopting  a 
stage-driven  architecture  instead. 

Finally,  we  are  investigating  supporting 
more  sophisticated  statistical,  information 
retrieval  and  natural  language  processing 
approaches  as  additional  heuristics  to  improve 
the  utility  and  accuracy  of  our  current  system. 

6.  Conclusion 

Many  web  pages  contain  excessive 
clutter  around  the  body  of  an  article.  Although 
much  research  has  been  done  on  content 
extraction,  it  is  still  a  relatively  new  field.  Our 
approach,  working  with  the  Document  Object 
Model  tree  as  opposed  to  raw  HTML  markup, 
enables  us  to  perform  Content  Extraction, 
identifying  and  preserving  the  original  data 
instead  of  summarizing  it.  The  techniques  that 
we  have  employed,  though  simple,  are  quite 
effective.  As  a  result,  we  were  able  to 
implement  our  approach  in  a  publicly-available 
Web  proxy  that  anyone  can  use  to  extract 
content  from  HTML  web  pages  for  their  own 
purposes. 
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